Add UniDiffuser model and pipeline by dg845 · Pull Request #2963 · huggingface/diffusers

dg845 · 2023-04-04T10:46:48Z

This PR implements a pipeline for the UniDiffuser model as discussed in #2857.

Model/Pipeline Description

The UniDiffuser model (paper, code) is a multi-modal model which extends the DDPM model to model all distributions relevant to a set of multi-modal data. From the paper abstract:

This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is – learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model – perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead...

In this PR, we implement a image-text UniDiffuser model as described in the paper:

Usage Examples

import requests
import torch
from PIL import Image
from io import BytesIO

from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "dg845/unidiffuser-diffusers"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Joint image-text generation. The generation task is automatically inferred.
sample = pipe(num_inference_steps=20, guidance_scale=8.0)
image = sample.images[0]
text = sample.text[0]
image.save("unidiffuser_sample_joint_image.png")
print(text)

# The mode can be set manually. The following is equivalent to the above:
pipe.set_joint_mode()
sample2 = pipe(num_inference_steps=20, guidance_scale=8.0)

# Note that if you set the mode manually the pipeline will no longer attempt
# to automatically infer the mode. You can re-enable this with reset_mode().
pipe.reset_mode()

# Text-to-image generation.
prompt = "an elephant under the sea"

sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image.save("unidiffuser_sample_text2img_image.png")

# Image-to-text generation.
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
response = requests.get(image_url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((512, 512))

sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)

# Image variation can be performed with a image-to-text generation followed by a text-to-image generation:
sample = pipe(prompt=i2t_text, num_inference_steps=20, guidance_scale=8.0)
final_image = sample.images[0]
final_image.save("unidiffuser_image_variation_sample.png")

# Text variation can be performed with a text-to-image generation followed by a image-to-text generation:
sample = pipe(image=t2i_image, num_inference_steps=20, guidance_scale=8.0)
final_prompt = sample.text[0]
print(final_prompt)

TODO

Implement UniDiffuserModel [U-ViT (paper, code)]
Implement UniDiffuserPipeline
Script to convert UniDiffuser checkpoints to diffusers checkpoints
Upload pre-trained UniDiffuser model [see this comment for more details]
Create tests for UniDiffuserPipeline
Create documentation for UniDiffuserPipeline
Add docstrings for model and pipeline
Add usage example(s)

Discussion

(TBD)

CC

@patrickvonplaten
@nemonameless
@baofff (author on original paper, author of original code)

HuggingFaceDocBuilderDev · 2023-04-04T10:52:51Z

The documentation is not available anymore as the PR was closed or merged.

dg845 · 2023-04-04T11:03:17Z

Currently, the code in the PR isn't in a working state, and I haven't implemented tests or tested the code yet. I've opened the PR because I wanted to get some preliminary feedback on the design and code. In particular, I have the following questions:

Design Questions:

Since the image-text UniDiffuser model is capable of doing marginal text or image generation, conditional text-to-image and image-to-text generation, and joint image-text generation, I've currently implemented the __call__ method to have a mode parameter that allows the user to generate text, images, text-conditioned images, etc. as desired. I'm not sure if this fits in with the pipeline design philosophy, particularly the principle that

Every pipeline should have one and only one way to run it via a __call__ method.

Would it be better if I split UniDiffuserPipeline into separate pipelines for each generation task: e.g. UniDiffuserTextToImagePipeline, etc., akin to VersatileDiffusionTextToImagePipeline, etc.?

In particular, I would greatly appreciate some preliminary feedback on the main implemented classes: UniDiffuserPipeline, UniDiffuserModel, and UniDiffuserTextDecoder.

Questions about Tests:

Is there a guide to writing tests for the diffusers library?
1. Partial answer: I've found the transformers testing guide to be useful, and I think most of the stuff there is applicable to diffusers as well.
Is there an easy way to find small model checkpoints for testing, such as analogues to CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")?
1. The transformers testing guide suggests something like grep -r "tiny" tests/ examples/ to find examples of tiny models/pipelines/etc. for testing.
2. I think the hf-internal-testing hub page should also list all such models.

[if there is a better place to move this discussion, please let me know :) ]

patrickvonplaten · 2023-04-06T09:25:52Z

+    def __init__(
+        self,
+        tokenizer: GPT2Tokenizer,
+        text_decoder: GPT2LMHeadModel,


Can we try to seperate the tokenizer and text decoder here.

diffusers should be able to load the tokenizer out of the box, you just have to define it in the pipeline, e.g. here: https://github.com/huggingface/diffusers/pull/2963/files#r1159526676

Also we cannot pass the text_decoder here at init as this would prevent us to be able to use from_pretained(...) of the model class. Could you maybe try to follow the design as done here:

diffusers/src/diffusers/pipelines/spectrogram_diffusion/continous_encoder.py

Line 29 in a9477bb

class SpectrogramContEncoder(ModelMixin, ConfigMixin, ModuleUtilsMixin):

See how we import blocks from transformers to design our own new model. I think here you could just do the following:

def __init__( self, num_layers=12, ... ): config = GPT2Config(...take all the config params from init) self.text_decoder = GPT2LMHeadModel(config)

We then design a new checkpoint architecture for the UniDiffusersTextDecoder and upload pretrained weights for it

Changed the design of __init__ following the example: removed tokenizer and text_decoder args, added GPT2 config args.

patrickvonplaten · 2023-04-06T09:27:22Z

+        eos = "<|EOS|>"
+        special_tokens_dict = {"eos_token": eos}
+        self.tokenizer = tokenizer
+        self.tokenizer.add_special_tokens(special_tokens_dict)


Note that we can do this directly for the uploaded tokenizer. E.g. let's just upload a tokenizer that has EOS already added so that we don't have to do it every time we call the model at init

More than happy to help here later on!

Removed the tokenizer logic from __init__, will work on uploading the appropriate tokenizer.

I've prepared some native diffusers checkpoints for the current implementation of the UniDiffuserPipeline and its building blocks (e.g. UniDiffuserModel, UniDiffuserTextDecoder, etc.) [see the convert_to_ckpt.py script]. How can I upload these up to the hub?

I was able to upload some models to the hub (see e.g. small test models here), but I'm confused about how to save/push to hub a tokenizer with added special tokens. The documentation for PreTrainedTokenizerBase.from_pretrained says that it won't save modifications to the tokenizer after initialization and I wasn't able to find any resources on how to do it after searching.

For reference, the code in the base unidiffuser library is something like

eos = '<|EOS|>' special_tokens_dict = {'eos_token': eos} base_tokenizer = GPT2Tokenizer.from_pretrained('gpt2') base_tokenizer.add_special_tokens(special_tokens_dict)

Regarding uploading the weights and the new tokenizer, you can can call push_to_hub() directly on the model.

So, for example (considering UniDiffuserModel is already populated with the pre-trained checkpoints):

unidiffusers = UniDiffuserModel(...) unidiffusers.push_to_hub("your_hub_user_name/model_id")

Same applies for the rest of the models and the tokenizer.

patrickvonplaten · 2023-04-06T09:27:54Z

+        self.transformer = text_decoder
+        # TODO: need to set the eos_token_id correctly
+        self.transformer.config.eos_token_id = self.tokenizer.eos_token_id
+        self.transformer.resize_token_embeddings(len(self.tokenizer))


We can also make sure that the GPT2Transformer has the correct number of word embeddings before loading it so that we don't have to always resize the embedding every time at init

+1. I would prefer to have the decoder with the rejigged embeddings on the Hub rather than rejigging on the fly.

patrickvonplaten · 2023-04-06T09:32:36Z

+        return generated_captions
+
+    @torch.no_grad()
+    def generate_beam(


works for me!

patrickvonplaten · 2023-04-06T09:33:47Z

+    """
+
+    @register_to_config
+    def __init__(


The design here looks good to me! Note that I think we can remove some redundant code that is not needed for this use case. I think you only need one of the three cases:

self.is_input_continuous = (in_channels is not None) and (patch_size is None) self.is_input_vectorized = num_vector_embeds is not None self.is_input_patches = in_channels is not None and patch_size is not None

Should have removed most of the redundant code (kept only the code handling the patch input case, since that's what the original UniDiffuser implementation used).

patrickvonplaten · 2023-04-06T09:43:06Z

Great first design! I left some comments directly in the code. In short I think the general design is very nice - the models should be defined under the pipeline folder just like you do and the pipeline also looks quite nice already.

Answering your questions in line

Currently, the code in the PR isn't in a working state, and I haven't implemented tests or tested the code yet. I've opened the PR because I wanted to get some preliminary feedback on the design and code. In particular, I have the following questions:

Design Questions:

Since the image-text UniDiffuser model is capable of doing marginal text or image generation, conditional text-to-image and image-to-text generation, and joint image-text generation, I've currently implemented the __call__ method to have a mode parameter that allows the user to generate text, images, text-conditioned images, etc. as desired. I'm not sure if this fits in with the pipeline design philosophy, particularly the principle that

Every pipeline should have one and only one way to run it via a __call__ method.

Would it be better if I split UniDiffuserPipeline into separate pipelines for each generation task: e.g. UniDiffuserTextToImagePipeline, etc., akin to VersatileDiffusionTextToImagePipeline, etc.?

I think since the purpose of UniDiffusers is exactly to bring all modes into the same distribution, one pipeline is nice here. So this design works for me. I'd maybe just not have a "mode" call input, but instead automatically decide the mode depending on what the user puts in. E.g. if the user just passes a "text" input, we're in text2img mode, if just a "image" input, we're in image to text mode => would this design work or are the inputs not enough to define which mode one is in? E.g. are muiltple modes possible for the same input combination?

In particular, I would greatly appreciate some preliminary feedback on the main implemented classes: UniDiffuserPipeline, UniDiffuserModel, and UniDiffuserTextDecoder.

Left comments mostly directly in the code. In short:
UniDiffuserPipeline - looks good already, just:

let's make the tokenizer directly an input
if possible remove the mode input or if we can't make it maybe a setter variable pipe.set_text_to_image()

UniDiffuserModel - looks good, let's just remove all code that we don't need

UniDiffuserTextDecoder - here we need to change the init design a bit so that it would work flawlessly with from_pretrained(...) e.g. we can have models such as gpt2lmhead in the init (left some comments diretly in the code)

Questions about Tests:

Is there a guide to writing tests for the diffusers library?

Partial answer: I've found the transformers testing guide to be useful, and I think most of the stuff there is applicable to diffusers as well.

Is there an easy way to find small model checkpoints for testing, such as analogues to CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")?

Not really. Some guides that could help:

https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md
Just look at other tests, maybe here you can get inspired by the versatile diffusion tests: https://github.com/huggingface/diffusers/tree/main/tests/pipelines/versatile_diffusion

The transformers testing guide suggests something like grep -r "tiny" tests/ examples/ to find examples of tiny models/pipelines/etc. for testing.

I think the hf-internal-testing hub page should also list all such models.

Regarding tiny models, yeah we just create them ourselves. What you can do here is to just load tiny configs to create random tiny models and use those for faster testing :-)

[if there is a better place to move this discussion, please let me know :) ]

Hope this helps a bit so that you can move forward, let me know if you need more help!

dg845 · 2023-04-07T02:33:51Z

Thanks for the review! With regards to this:

E.g. if the user just passes a "text" input, we're in text2img mode, if just a "image" input, we're in image to text mode => would this design work or are the inputs not enough to define which mode one is in? E.g. are muiltple modes possible for the same input combination?

for the currently supported modes, there is some ambiguity when neither text nor image input is provided. In this case, we cannot be sure whether the user wants unconditional ("marginal") image generation, unconditional ("marginal") text generation, or joint image-text generation.

The original code additionally supports image variation ("img2text2img") and text variation ("text2img2text") modes, whose inputs would be the same as the image-to-text (a conditioning image) and text-to-image (a conditioning prompt) modes, respectively. So supporting these modes would also cause some ambiguity.

So perhaps we could infer the mode in __call__, with e.g. only text input defaulting to the text2img mode and providing neither text nor image input defaulting to img mode. We would also provide setter variables pipe.set_text_to_image(), pipe.set_text_variation(), etc. and if the user uses the setter variables, we would respect that mode over the inferred mode.

[Just as a side note, the image variation implementation is different between StableDiffusionImageVariationPipeline and UniDiffuser. The Stable Diffusion image variation model is essentially a Stable Diffusion model trained with the text encoder swapped out for a CLIP-based image encoder (see here), but UniDiffuser uses its ability to do both image-conditioned text generation and text-conditioned image generation to do a "round-trip translation" of the image into text and back to an image. Not sure this is relevant to the discussion, just something I found interesting :). ]

[Edit: pushed new commit with possible implementation as described above]

nemonameless · 2023-04-07T13:59:49Z

I have referenced some codes of yours and combined with mine, and also submited an initial version PR PaddlePaddle/PaddleNLP#5487 , hope to learn from each other and contribute to the community

dg845 · 2023-04-08T02:51:19Z

Hi @patrickvonplaten and @baofff,

In looking at the noise prediction model architecture, I'm using BasicTransformerBlock as my transformer block, which I've noticed has two main differences as compared to the Block implementation in the original code:

BasicTransformerBlock is pre-LayerNorm, while Block is post-LayerNorm.
Block has the LayerNorms on the residual backbone of the block, whereas BasicTransformerBlock does not. (That is, in Block, the layer norm is applied after the skip connections, rather than before.)

In light of this, I have the following questions:

Should we expect a big difference in performance between the two implementations for inference? (The paper reports that using a pre-LayerNorm transformer is numerically unstable when training a UniDiffuser model.)
Should I follow the original implementation as closely as possible?

dg845 · 2023-04-15T02:50:45Z

As a note, if you want to look at the code I used to calculate the expected_slices for the fast default tests, you can look at https://github.com/dg845/unidiffuser/blob/test_sampling/sample_test_v1.py.

patrickvonplaten · 2023-04-21T17:53:52Z

Very cool! This looks like almost ready to be merged to me - thanks a lot for re-iterating on the design :-)

patrickvonplaten · 2023-04-21T17:54:10Z

@williamberman @sayakpaul when you have a moment, it'd be super cool if you could review

sayakpaul · 2023-04-24T10:19:22Z

+        stop_token: str = "<|EOS|>",
+    ):
+        """
+        Generates text using the given tokenizer and text prompt or token embedding via beam search.


Help me understand this a bit. Why would there be a need to generate text from a given text prompt?

Sorry, I think I wrote this docstring in a confusing way. In the context of UniDiffuser sampling, we use this function to generate output text (when appropriate) from the text latents after we process the CLIP-embedded input prompt using the unet (UniDiffuserModel) model. The method accepts both prompt and embed arguments, for input tokens and embeddings respectively, but we only ever call it with input embeddings (as described above):

diffusers/src/diffusers/pipelines/unidiffuser/modeling_text_decoder.py

Line 135 in 2fdb7b8

generated_captions.append(self.generate_beam(tokenizer, embed=feature, device=device)[0])

Oh okay. Yeah, then I guess we need to it make it a bit clearer from the code?

sayakpaul · 2023-04-24T10:26:50Z

+            labels (`torch.Tensor`, *optional*):
+                TODO
+        """
+        embedding_text = self.transformer.transformer.wte(tokens)


Why transformer.transformer?

I took the forward(...) method from the original code. Upon reviewing the code, I think the forward method was probably intended to do the following: the tokens argument is a sequence of input vocab token IDs for the GPT2LMHeadModel, while the prefix argument is the hidden state of another model (e.g. something like transformers.modeling_outputs.BaseModelOutputWithPooling.last_hidden_state of a CLIPTextModel). prefix then gets converted to an intermediate representation via self.encode_prefix(...) and then converted into the latent space of the GPT model via self.decode_prefix(...) (if they are being used). We then combine the embedding of tokens with the prefix embedding and then do a forward pass of the internal GPT2LMHeadModel.

I guess it's confusing currently because on lines 52-54 instead of using n_embd as the input dimension to nn.Linear we should instead have a new argument prefix_inner_dim and use that, e.g.

self.encode_prefix = ( nn.Linear(prefix_inner_dim, self.prefix_hidden_dim) if self.prefix_hidden_dim is not None else nn.Identity() )

Furthermore, prefix_hidden_dim should probably always need to be supplied, since prefix_inner_dim and n_embd are in general not guaranteed to be the same.

Thanks a lot for explaining!

I guess we will know better once we start testing the code.

sayakpaul · 2023-04-24T10:42:06Z

Design looks quite clean and matured to me. My questions / comments are very minor ones and are probably already covered by @patrickvonplaten's comments.

I think since the purpose of UniDiffusers is exactly to bring all modes into the same distribution, one pipeline is nice here. So this design works for me. I'd maybe just not have a "mode" call input, but instead automatically decide the mode depending on what the user puts in. E.g. if the user just passes a "text" input, we're in text2img mode, if just a "image" input, we're in image to text mode => would this design work or are the inputs not enough to define which mode one is in? E.g. are muiltple modes possible for the same input combination?

+1 to this.

dg845 · 2023-04-25T09:53:54Z

I've uploaded a diffusers version of the unidiffuser-v1 checkpoint at https://huggingface.co/dg845/unidiffuser-diffusers and a small random testing pipeline at https://huggingface.co/dg845/unidiffuser-diffusers-test. Note that the text_encoder is from openai/clip-vit-large-patch14 and the image_encoder and image_processor are from openai/clip-vit-base-patch32, which should match the frozen CLIP encoders used by the original implementation. text_tokenizer should have the new EOS token added, and text_decoder should have its embeddings appropriately resized for the new token.

I've also opened a PR at hf-internal-testing/diffusers-images in the hub to add an example image for UniDiffuserPipeline testing.

sayakpaul · 2023-04-25T10:01:54Z

This is great! Thanks so much for your efforts. I think now the TODOs are:

Tests and make sure they pass.
Transfer https://huggingface.co/dg845/unidiffuser-diffusers and https://huggingface.co/dg845/unidiffuser-diffusers-test to appropriate repositories. For the first, it would be something like https://hf.co/thu-ml and for the second it will be https://hf.co/hf-internal-testing. Note that you will still retain the authorship :)
Fill out the model card for the https://huggingface.co/dg845/unidiffuser-diffusers.
Update docs in this PR.

I have also merged your PR. So, hopefully, this unblocks you. @patrickvonplaten can help us with the repo transfers.

patrickvonplaten · 2023-04-28T08:50:31Z

Let me know once you need help with a model transfer

* Fix a bug of pano when not doing CFG * enhance code quality * apply formatting. --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

* fix progress bar issue in pipeline_text_to_video_zero.py. Copy scheduler after first backward * fix tensor loading in test_text_to_video_zero.py * make style && make quality

* fix: norm group test for UNet3D. * chore: speed up the panorama tests (fast). * set default value of _test_inference_batch_single_identical. * fix: batch_sizes default value.

sayakpaul · 2023-05-25T06:38:58Z

+| Pipeline | Tasks | Demo
+|---|---|:---:|
+| [UniDiffuserPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_unidiffuser.py) | *Joint Image-Text Gen*, *Text-to-Image*, *Image-to-Text*, *Image Gen*, *Text Gen*, *Image Variation*, *Text Variation* |  |


Suggested change

| Pipeline | Tasks | Demo

|---|---|:---:|

| [UniDiffuserPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_unidiffuser.py) | *Joint Image-Text Gen*, *Text-to-Image*, *Image-to-Text*, *Image Gen*, *Text Gen*, *Image Variation*, *Text Variation* | |

| Pipeline | Tasks | Demo | Colab |

|---|---|:---:|

| [UniDiffuserPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_unidiffuser.py) | *Joint Image-Text Gen*, *Text-to-Image*, *Image-to-Text*, *Image Gen*, *Text Gen*, *Image Variation*, *Text Variation* | [🤗 Spaces](https://huggingface.co/spaces/thu-ml/unidiffuser) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/unidiffuser.ipynb) |

For now, let's add a link to the original demo. @hysts is working on to change the demo to have diffusers usage.

Prepared a Colab Notebook from your awesome documentation: huggingface/notebooks#377

Also prepared this GIF to showcase the powerfulness of the pipeline:

sayakpaul · 2023-05-25T06:41:59Z

+import requests
+import torch
+from PIL import Image
+from io import BytesIO
+
+from diffusers import UniDiffuserPipeline
+
+device = "cuda"
+model_id_or_path = "thu-ml/unidiffuser-v1"
+pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
+pipe.to(device)
+
+# Image-to-text generation
+image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
+response = requests.get(image_url)
+init_image = Image.open(BytesIO(response.content)).convert("RGB")
+init_image = init_image.resize((512, 512))
+
+sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
+i2t_text = sample.text[0]
+print(text)


Suggested change

import requests

import torch

from PIL import Image

from io import BytesIO

from diffusers import UniDiffuserPipeline

device = "cuda"

model_id_or_path = "thu-ml/unidiffuser-v1"

pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)

pipe.to(device)

# Image-to-text generation

image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"

response = requests.get(image_url)

init_image = Image.open(BytesIO(response.content)).convert("RGB")

init_image = init_image.resize((512, 512))

sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)

i2t_text = sample.text[0]

print(text)

from diffusers import UniDiffuserPipeline

from diffusers.utils import load_image

device = "cuda"

model_id_or_path = "thu-ml/unidiffuser-v1"

pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)

pipe.to(device)

# Image-to-text generation

image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"

init_image = load_image(image_url).resize((512, 512))

sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)

i2t_text = sample.text[0]

print(i2t_text)

Reduces the LoC :)

sayakpaul · 2023-05-25T06:42:43Z

+# Image variation can be performed with a image-to-text generation followed by a text-to-image generation:
+# 1. Image-to-text generation
+image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
+response = requests.get(image_url)
+init_image = Image.open(BytesIO(response.content)).convert("RGB")
+init_image = init_image.resize((512, 512))


I guess we can follow the same as https://github.com/huggingface/diffusers/pull/2963/files#r1205061596 for loading and resizing the image?

sayakpaul · 2023-05-25T06:44:32Z

+	- all
+	- __call__
+
+## ImageTextPipelineOutput


This should go here:

https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/outputs.mdx

Also wanted to know if there's any argument control the number of images / text I wanted to generate as a part of the variation mode.

I can control the num_images_per_prompt in the text-to-image mode, so that's settled. But what about text variation?

For modes which generate only text (img2text and text), there's an analogous num_prompts_per_image argument to __call__ . So when you perform the second img2text generation for text variation you can specify num_prompts_per_image > 1 to get multiple text variation samples.

This should go here:

https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/outputs.mdx

It feels more natural to me to have the documentation for ImageTextPipelineOutput alongside ImagePipelineOutput, which is at the Diffusion Pipeline doc page.

I've gone ahead and moved the ImageTextPipelineOutput documentation to /api/diffusion_pipeline.mdx (alongside the ImagePipelineOutput and AudioPipelineOutput documentation). Let me know if it would be better somewhere else (for example, at /api/outputs.mdx as originally suggested) :).

I understand. But we will soon update that too :)

Cc: @patrickvonplaten

I see, so would it be better if I move it to /api/outputs? Or is it fine to leave it at /api/diffusion_pipeline for now?

Let's keep it as is for now. Then we will bulk move things :)

sayakpaul · 2023-05-25T08:50:47Z

+
+sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
+i2t_text = sample.text[0]
+print(text)


Nit: should be i2t_text.

sayakpaul · 2023-05-25T09:29:21Z

+
+### Unconditional Image and Text Generation
+
+Unconditional generation (where we start from only latents sampled from a standard Gaussian prior) from a `UniDiffuserPipeline` will produce a (image, text) pair:


Suggested change

Unconditional generation (where we start from only latents sampled from a standard Gaussian prior) from a `UniDiffuserPipeline` will produce a (image, text) pair:

Unconditional generation (where we start from only latents sampled from a standard Gaussian prior) from a [`UniDiffuserPipeline`] will produce a (image, text) pair:

So, that the hyperlink is automatically rendered.

sayakpaul · 2023-05-25T09:30:14Z

+print(text)
+```
+
+The `img2text` mode requires that an input `image` be supplied. You can set the `img2text` mode manually with [`UniDiffuser.set_image_to_text_mode`].


Suggested change

The `img2text` mode requires that an input `image` be supplied. You can set the `img2text` mode manually with [`UniDiffuser.set_image_to_text_mode`].

The `img2text` mode requires that an input `image` be supplied. You can set the `img2text` mode manually with [`UniDiffuserPipeline.set_image_to_text_mode`].

…fuser.mdx to /api/diffusion_pipeline.mdx.

patrickvonplaten · 2023-05-25T21:00:00Z

+        self.transformer = GPT2LMHeadModel(gpt_config)
+
+    def forward(
+        self,


patrickvonplaten · 2023-05-25T21:00:14Z

+        return self.encode_prefix(prefix)
+
+    @torch.no_grad()
+    def generate_captions(self, features, eos_token_id, device):


patrickvonplaten · 2023-05-25T21:01:21Z

+        eos_token_id: Optional[int] = None,
+        input_ids=None,
+        input_embeds=None,
+        device=None,
+        beam_size: int = 5,
+        entry_length: int = 67,
+        temperature: float = 1.0,


Suggested change

eos_token_id: Optional[int] = None,

input_ids=None,

input_embeds=None,

device=None,

beam_size: int = 5,

entry_length: int = 67,

temperature: float = 1.0,

input_ids=None,

input_embeds=None,

device=None,

beam_size: int = 5,

entry_length: int = 67,

temperature: float = 1.0,

eos_token_id: Optional[int] = None,

(nit) Let's change the order here maybe since the eos_token_id should probably not be the first input

Changed :).

patrickvonplaten · 2023-05-25T21:01:54Z

+        cross_attention_kwargs=None,
+        class_labels=None,
+    ):
+        # Pre-LayerNorm


patrickvonplaten

Great this PR looks good to go for me! Just one final tiny nit regarding the ordering of the generate input.

Apart from this, this is good to merge from my side :-) Incredible work here @dg845! This is really a difficult model with many components and the final implementation is super nice :-)

sayakpaul · 2023-05-26T01:16:27Z

@dg845 once the conflicts are resolved and tests pass, we will merge :)

Meanwhile, I will also correct the gif.

Really amazing contribution. I hope the contribution experience was enjoyable for you.

sayakpaul · 2023-05-26T01:18:15Z

@patrickvonplaten a friendly ping for these transfers:

#2963 (comment)

dg845 · 2023-05-26T01:32:46Z

Thanks! I really enjoyed working on this PR :). And thanks for all the advice and help along the way :).

patrickvonplaten · 2023-05-26T10:01:30Z

@sayakpaul feel free to merge whenever! All good from my side

sayakpaul · 2023-05-26T19:57:57Z

@dg845 thanks again for your amazing contribution. The pipeline and the components are now live at: https://huggingface.co/docs/diffusers/main/en/api/pipelines/unidiffuser

dg845 mentioned this pull request Apr 4, 2023

Any plan to support UniDiffuser? #2857

Closed

patrickvonplaten reviewed Apr 6, 2023

View reviewed changes

Comment thread src/diffusers/pipelines/unidiffuser/modeling_uvit.py Outdated

patrickvonplaten reviewed Apr 6, 2023

View reviewed changes

Comment thread src/diffusers/pipelines/unidiffuser/pipeline_unidiffuser.py

patrickvonplaten reviewed Apr 6, 2023

View reviewed changes

patrickvonplaten requested review from sayakpaul and williamberman April 21, 2023 17:54

sayakpaul reviewed Apr 24, 2023

View reviewed changes

Comment thread src/diffusers/pipelines/unidiffuser/modeling_text_decoder.py

sayakpaul reviewed Apr 24, 2023

View reviewed changes

ernestchu and others added 4 commits May 5, 2023 07:22

Fix a bug of pano when not doing CFG (huggingface#3030)

115e382

* Fix a bug of pano when not doing CFG * enhance code quality * apply formatting. --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

Text2video zero refinements (huggingface#3070)

10c54cb

* fix progress bar issue in pipeline_text_to_video_zero.py. Copy scheduler after first backward * fix tensor loading in test_text_to_video_zero.py * make style && make quality

Release: v0.15.0

945f300

[Tests] Speed up panorama tests (huggingface#3067)

322b5cb

* fix: norm group test for UNet3D. * chore: speed up the panorama tests (fast). * set default value of _test_inference_batch_single_identical. * fix: batch_sizes default value.

sayakpaul reviewed May 25, 2023

View reviewed changes

sayakpaul mentioned this pull request May 25, 2023

[Diffusers] add: colab notebook on unidiffuser huggingface/notebooks#377

Merged

sayakpaul reviewed May 25, 2023

View reviewed changes

dg845 added 2 commits May 25, 2023 03:04

Make improvements to the documentation.

d4b11aa

Move ImageTextPipelineOutput documentation from /api/pipelines/unidif…

98ce17d

…fuser.mdx to /api/diffusion_pipeline.mdx.

patrickvonplaten reviewed May 25, 2023

View reviewed changes

patrickvonplaten approved these changes May 25, 2023

View reviewed changes

dg845 added 2 commits May 25, 2023 18:05

Change order of arguments for UniDiffuserTextDecoder.generate_beam.

f8c325a

make style

b4feac8

Merge branch 'main' into unidiffuser-pipeline

4f21661

sayakpaul reviewed May 26, 2023

View reviewed changes

Comment thread docs/source/en/api/pipelines/unidiffuser.mdx Outdated

Update docs/source/en/api/pipelines/unidiffuser.mdx

07d68d7

sayakpaul merged commit 352ca31 into huggingface:main May 26, 2023

dg845 deleted the unidiffuser-pipeline branch May 27, 2023 02:09

patrickvonplaten changed the title ~~[WIP] Add UniDiffuser model and pipeline~~ Add UniDiffuser model and pipeline May 30, 2023


		### Unconditional Image and Text Generation

		Unconditional generation (where we start from only latents sampled from a standard Gaussian prior) from a `UniDiffuserPipeline` will produce a (image, text) pair:

	The `img2text` mode requires that an input `image` be supplied. You can set the `img2text` mode manually with [`UniDiffuser.set_image_to_text_mode`].
	The `img2text` mode requires that an input `image` be supplied. You can set the `img2text` mode manually with [`UniDiffuserPipeline.set_image_to_text_mode`].

Uh oh!

Conversation

dg845 commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Model/Pipeline Description

Usage Examples

TODO

Discussion

CC

Uh oh!

HuggingFaceDocBuilderDev commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dg845 commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design Questions:

Questions about Tests:

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented Apr 6, 2023

Design Questions:

Questions about Tests:

Uh oh!

dg845 commented Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nemonameless commented Apr 7, 2023

Uh oh!

dg845 commented Apr 8, 2023

Uh oh!

dg845 commented Apr 15, 2023

Uh oh!

patrickvonplaten commented Apr 21, 2023

Uh oh!

patrickvonplaten commented Apr 21, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sayakpaul Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 Apr 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 commented Apr 4, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 4, 2023 •

edited

Loading

dg845 commented Apr 4, 2023 •

edited

Loading

patrickvonplaten Apr 6, 2023 •

edited

Loading

dg845 commented Apr 7, 2023 •

edited

Loading

sayakpaul Apr 24, 2023 •

edited

Loading

dg845 Apr 25, 2023 •

edited

Loading

sayakpaul May 25, 2023 •

edited

Loading

sayakpaul May 25, 2023 •

edited

Loading

sayakpaul May 25, 2023 •

edited

Loading

sayakpaul May 25, 2023 •

edited

Loading